Skip to main content

View Post [edit]

Poster: hank_b Date: Mar 9, 2010 11:49pm
Forum: texts Subject: Re: PDF's with text layer

Unfortunately, we're not currently set up to extract the text layer from contributed PDFs.

Having significant numbers of PDFs with text layers uploaded is a relatively new thing - our processing pathways were all originally designed to work from images, not PDFs. We do now derive from contributed PDFs, but only through the makeshift approach of extracting the images from the PDF and processing them as though they came from scanning a book. That means the OCR results you see (via the "Full Text" link) come from running OCR on your images, so yes, it has the same (limited) level of accuracy as the OCR we produce for scanned books.

Extracting text directly from contributed PDFs is on the to-do list, but I'm afraid I can't give you any estimate of when it will happen.

Hank Bromley
software engineer
Internet Archive

Reply [edit]

Poster: genet Date: Jan 23, 2011 12:14pm
Forum: texts Subject: Re: PDF's with text layer

Any news on this front? I am preparing more uploads with carefully proofread text layers, would sure be nice if they could be used!
Thanks,
-Gene

Reply [edit]

Poster: hank_b Date: Jan 23, 2011 12:32pm
Forum: texts Subject: Re: PDF's with text layer

Gene-

Unfortunately, no, nothing new to report on using the text layer found within contributed PDFs.

-- Hank

Reply [edit]

Poster: virtualverse0 Date: Apr 19, 2017 7:03am
Forum: texts Subject: Re: PDF's with text layer

any news about extracting text directly from contributed PDFs ? and about upload my own ocr xml file (made with adobe acrobat pro dc and manually fine revisioned) to substitute the abby ocr xml provided by archive.org?
This post was modified by virtualverse on 2017-04-19 14:03:06

Reply [edit]

Poster: genet Date: Jan 23, 2011 12:36pm
Forum: texts Subject: Re: PDF's with text layer

Thanks again, Hank!
-Gene

Reply [edit]

Poster: genet Date: Mar 15, 2010 10:14am
Forum: texts Subject: Re: PDF's with text layer

Thanks, Hank!
I just wanted to be sure that I wasn't doing anything
incorrectly.
-Gene